Visualizing Air Pollution Disparities in New York City

Nick Sawhney

Overview:

Every New Yorker has had a day where the city air just didn't agree with them. Though air pollution in New York has steadily been getting better over the years, it still is impacting the health of many across the city. I wanted to analyze the sources of air pollution within the city and the disparity between sources and effects (edit later). I use the Air Quality dataset from NYC Open Data to visualize the disparities between pollution sources and their health effects. I'll be using the data analysis library pandas for the air quality data, and geopandas for geographic data analysis. For visualization, I'll be using matplotlib for static maps and folium for interactive maps.

Formatting the Data

Let's import the libraries we're using

In [1]:
import pandas as pd
import geopandas as gpd
import folium
import branca
from matplotlib import pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable

I'm using DOHMH's Air Quality dataset from NYC Open Data. Let's take a quick look at it.

In [15]:
air_data = pd.read_csv('Air_Quality.csv', encoding='latin1')
air_data.head()
Out[15]:
indicator_data_id indicator_id name Measure geo_type_name geo_entity_id geo_entity_name year_description data_valuemessage
0 130728 646 Air Toxics Concentrations- Average Benzene Con... Average Concentration Borough 1 Bronx 2005 2.8
1 130729 646 Air Toxics Concentrations- Average Benzene Con... Average Concentration Borough 2 Brooklyn 2005 2.8
2 130730 646 Air Toxics Concentrations- Average Benzene Con... Average Concentration Borough 3 Manhattan 2005 4.7
3 130731 646 Air Toxics Concentrations- Average Benzene Con... Average Concentration Borough 4 Queens 2005 1.9
4 130732 646 Air Toxics Concentrations- Average Benzene Con... Average Concentration Borough 5 Staten Island 2005 1.6

This dataset has a name, measure type, and year/time period, and region for each measure. I want to make sure that I'm only working with neighborhood-level data, so I create a column that combines all the details of the measurement into one description. That way, I can have a unique column for each measurement in each region. I also remove Boroguh and Citywide data from the set because we only care about neighborhood-level data for this experiement, and hospitalization data from 2005-2007, since our average pollutant levels are all 2009-2010

In [3]:
air_data = air_data[~air_data['name'].str.startswith('Traffic Density')]
air_data = air_data[~air_data['year_description'].str.contains('2005-2007')]
air_data = air_data[air_data['geo_type_name'] == 'UHF42'].reset_index()
air_data['pollutant'] = air_data['name'] + ', ' + air_data['year_description'] + ', ' + air_data['Measure']

Next, I'm going to import the geographic data I need to create the maps. I'm using GIS data provided by NYC DOH which shows each "United Health Fund" neighborhood's boundaries, which is the most granular regional designation in the Air Quality dataset.

First I clean the first row so pandas can handle the data, and I group the Air Quality dataset by region. I noticed that there were some small differences in how certain neighborhoods were written down (e.g. 'Crotona -Tremont' vs 'Crotona - Tremont'), so I use python's handy dictionary comprehensions and a pandas method to fix it. I also group the pollutant types for use later.

I merge the air quality dataset into the neighborhoods dataset based on the neighborhood.

In [4]:
neighborhoods = gpd.read_file('UHF_42_DOHMH_2009').dropna()
neighborhoods['UHF_NEIGH'][0] ='None'
neighborhoods = neighborhoods.set_index('UHF_NEIGH')
neighborhood_groups = air_data.groupby('geo_entity_name').groups
air_data = air_data.replace({x: y for x, y in zip(sorted(list(neighborhood_groups.keys())), sorted(list(neighborhoods.index))) if x!=y})
pollutant_groups = air_data.groupby('pollutant').groups
pollutants = [group for group in pollutant_groups]

for group in pollutant_groups:
    neighborhoods = pd.concat([neighborhoods, air_data.loc[pollutant_groups[group]].drop('index', axis=1).drop_duplicates().set_index('geo_entity_name')['data_valuemessage']], axis = 1, join_axes = [neighborhoods.index]).rename({'data_valuemessage': group}, axis=1)
    
neighborhoods = neighborhoods.reset_index()  

The final dataset should have everything we need to start exploring! As you can see, for each neighborhood we have geographic location and geometry data, as well as measurements for each pollutant. Now we can get to the fun part.

In [16]:
neighborhoods.head()
Out[16]:
UHF_NEIGH OBJECTID UHFCODE SHAPE_Leng SHAPE_Area BOROUGH geometry Air Toxics Concentrations- Average Benzene Concentrations, 2005, Average Concentration Air Toxics Concentrations- Average Formaldehyde Concentrations, 2005, Average Concentration Boiler Emissions- Total NOx Emissions, 2013, Per km2 ... O3-Attributable Asthma Hospitalizations , 2009-2011, Rate- 18 Yrs and Older O3-Attributable Asthma Hospitalizations , 2009-2011, Rate- Children 0 to 17 Yrs Old O3-Attributable Cardiac and Respiratory Deaths , 2009-2011, Rate PM2.5-Attributable Asthma ED Visits , 2009-2011, Rate- 18 Yrs and Older PM2.5-Attributable Asthma ED Visits , 2009-2011, Rate- Children 0 to 17 Yrs Old PM2.5-Attributable Cardiovascular Hospitalizations (Adults 40 Yrs and Older) , 2009-2011, Rate- 40 Years and Older PM2.5-Attributable Deaths , 2009-2011, Rate - Adults 30 Yrs and Older PM2.5-Attributable Respiratory Hospitalizations (Adults 20 Yrs and Older), 2009-2011, Rate- Adults 20 Yrs and Older avg_health_impact avg_concentration
0 Kingsbridge - Riverdale 2 101.0 57699.154353 1.332914e+08 Bronx POLYGON ((1017992.893460184 269222.9642282724,... 2.9 3.2 42.5 ... 7.5 22.3 8.6 25.4 60.7 18.3 77.6 18.6 0.463107 0.243182
1 Northeast Bronx 3 102.0 88219.319109 1.813708e+08 Bronx POLYGON ((1025012.990312353 270794.260298267, ... 2.8 3.2 33.8 ... 9.8 32.2 5.4 37.6 73.4 16.2 57.0 18.6 0.372490 0.205555
2 Fordham - Bronx Park 4 103.0 59711.871991 1.407724e+08 Bronx POLYGON ((1023994.479554102 261065.9674189389,... 2.7 3.2 71.0 ... 14.4 43.6 4.0 68.7 122.8 18.8 49.6 20.5 0.464138 0.372261
3 Pelham - Throgs Neck 5 104.0 250903.372273 3.865737e+08 Bronx (POLYGON ((1017075.038996696 237316.1822414398... 2.7 3.2 24.9 ... 11.7 34.2 4.5 57.5 121.9 19.3 54.5 19.7 0.468941 0.232744
4 Crotona - Tremont 6 105.0 66676.089072 1.068978e+08 Bronx POLYGON ((1007916.255148947 252530.7522059381,... 3.0 3.4 62.5 ... 18.9 39.6 2.7 131.3 200.2 21.6 47.4 25.1 0.668174 0.427734

5 rows × 30 columns

Lets Make Some Maps!

I'm going to need this helper function to add line breaks to the charts

In [6]:
#Thanks to Mohammad ElNesr on StackOverflow for this function!
def split_title_line(title_text, split_on='(', max_words=5):  # , max_words=None):
    """
    A function that splits any string based on specific character
    (returning it with the string), with maximum number of words on it
    """
    split_at = title_text.find (split_on)
    ti = title_text
    if split_at > 1:
        ti = ti.split (split_on)
        for i, tx in enumerate (ti[1:]):
            ti[i + 1] = split_on + tx
    if type (ti) == type ('text'):
        ti = [ti]
    for j, td in enumerate (ti):
        if td.find (split_on) > 0:
            pass
        else:
            tw = td.split ()
            t2 = []
            for i in range (0, len (tw), max_words):
                t2.append (' '.join (tw[i:max_words + i]))
            ti[j] = t2
    ti = [item for sublist in ti for item in sublist]
    ret_tex = []
    for j in range (len (ti)):
        for i in range(0, len(ti)-1, 2):
            if len (ti[i].split()) + len (ti[i+1].split ()) <= max_words:
                mrg = " ".join ([ti[i], ti[i+1]])
                ti = [mrg] + ti[2:]
                break

    if len (ti[-2].split ()) + len (ti[-1].split ()) <= max_words:
        mrg = " ".join ([ti[-2], ti[-1]])
        ti = ti[:-2] + [mrg]
    return '\n'.join (ti)

Thanks to geopandas, I can treat geographic data like any other dataframe, so it's relatively simple to plot heatmaps of all pollutants and health impacts. I inverted the color of Ozone Concentration because unlike other pollutants, it is the lack of Ozone that causes negative health impacts. Every map is shown below, and I'll be going into some interesting findings next!

Keep in mind, the colors you see on each map are all scaled differently because the indicators all have different impacts, i.e. a "high" concentration of CO2 is different than a "high" concentration of PM2.5 (a common industrial pollutant).

In [7]:
plt.rcParams['axes.titlesize'] = 20
fig, axs = plt.subplots(nrows=11, ncols=2, figsize=(30, 165), dpi=100)
for axis, pollutant in enumerate(pollutants):
    row = axis // 2
    col = axis % 2
    cm = 'viridis'
    if('Ozone' in pollutant and 'Concentration' in pollutant):
        cm = 'viridis_r'
    divider = make_axes_locatable(axs[row][col])
    cax = divider.append_axes("right", size="5%", pad=0.1)
    axs[row][col] = neighborhoods.plot(ax=axs[row][col], cmap = cm, column = pollutant, legend = True, cax=cax)
    axs[row][col].set_title(split_title_line(pollutant, max_words = 5))
    axs[row][col].axis('off')
axs[-1][-1].axis('off')
plt.savefig('figs/master.png') 
plt.show()

There's quite a bit of very interesting data here! It makes a lot of sense that the highest concentration of pollutant output or Ozone depletion happens in or around middle and lower Manhattan. These are areas with a lot of traffic congestion, close to the ConEd factory, and densely populated by people, industry, and vehicles. What's more interesting is when you compare the location of a pollutant to the location of its health effects.

Analysis

Not only do we have data on many different types of pollutants, but we also have data on hospitalizations, Emergency Room visits, and deaths related to low Ozone levels and PM2.5 (which refers to levels of very small particles in the air that come from a wide range of industrial and natural processes). PM2.5 in particular can cause many respiratory, cartiovascular, and immune system problems.

In [8]:
plt.rcParams['axes.titlesize'] = 20
pm25 = [6, 16, 18, 19]

#axs[0][0] = neighborhoods.plot(ax = axs[0][0], column = pollutants[6])
#axs[0][0].set_title(split_title_line(pollutants[6], max_words = 5))
#axs[0][0].axis('off')
def plot_four(pol):
    f, axs = plt.subplots(figsize = (30, 30), dpi = 100, squeeze = False, nrows = 2, ncols = 2)
    for i, p in enumerate(pol):
        #axs[i][0].axis('off')
        row = i // 2
        col = i % 2
        cm = 'viridis'
        if('Ozone' in pollutants[pol[i]] and 'Concentration' in pollutants[pol[i]]):
            cm = 'viridis_r'
        axs[row][col] = neighborhoods.plot(cmap = cm, ax = axs[row][col], column = pollutants[p], legend = True)
        axs[row][col].set_title(split_title_line(pollutants[pol[i]], max_words = 5))
        axs[row][col].axis('off')
    plt.show()

plot_four(pm25)

On the top left is neighborhood concentrations of PM2.5 patricles. The other maps show where people were hospitalized or passed way for any reason attributable to PM2.5. As you can see, there's quite a big disparity between the source-location of the pollutant and the areas most affected by it. There could be a variety of reasons for this. Maybe it's because many people in outer boroughs commute to industrial parts of Manhattan to work. It could be due to wind patterns, or socioeconomic circumstances limiting access to health. An even more stark disparity can be seen with Ozone levels. It could also be that these kinds of health problems simply do not happen enough to be statistically significant

In [9]:
oz = [i for i in range(9, 16, 2)]
plot_four(oz)

Aggregation and Interactive Visualization

Next, let's put all the pollution information together into an interactive map! Since each graph has a different scale, we need to figure out a way to put all the information together without losing data. We'll split the maps in two - one with the concentration of various pollutants, and one with the health effects.

In [10]:
pollutant_concentration = [pol for pol in pollutants if 'Attributable' not in pol and 'Ozone' not in pol]
health_effects = [pol for pol in pollutants if pol not in pollutant_concentration and ('Ozone' not in pol and 'O3' not in pol) ]

Next, we need to aggregate the data in a way that doesn't drown out the naturally lower ranges of concentration of some pollutants (for example, a 20% concentration of Sulfur Dioxide is a much more significant amount than a 20% concentration of Nitrogen Dioxide). Therefore, we will scale all the pollutants to be some value between 0 and 1, with 0 representing the lowest measured amount of that pollutant and 1 representing the highest. Then, we average all the normalized pollutant values. This is a nice way to get a good picture of the data as a whole, but remember that the coloring of the map is relative, showing the difference between the most and least polluted neighborhoods of New York, rather than the general amount of pollution in each neighborhood.

In [11]:
def normalize(df):
    return (df - df.min()) / (df.max() - df.min())
In [12]:
neighborhoods['avg_health_impact'] = sum(normalize(neighborhoods[column]) for column in health_effects)/len(health_effects)
neighborhoods['avg_concentration'] = sum([normalize(neighborhoods[column]) for column in pollutant_concentration])/len(pollutant_concentration)

Pollutant Concentration Map

Here is an interactive map of the average concentration that we just calculated.

In [13]:
m = folium.Map([40.730610, -73.935242], tiles = 'CartoDB positron')

m.add_child(folium.Choropleth(
        geo_data = neighborhoods.dropna(),
        name = 'pollution',
        data = neighborhoods.dropna(),
        columns = ['UHF_NEIGH', 'avg_concentration'],
        key_on='feature.properties.UHF_NEIGH',
        fill_color = 'YlOrRd',
        fill_opacity=.9,
        line_opacity=.4,
    )
)
Out[13]:
In [14]:
m = folium.Map([40.730610, -73.935242], tiles = 'CartoDB positron')

m.add_child(folium.Choropleth(
        geo_data = neighborhoods.dropna(),
        name = 'pollution',
        data = neighborhoods.dropna(),
        columns = ['UHF_NEIGH', 'avg_health_impact'],
        key_on='feature.properties.UHF_NEIGH',
        fill_color = 'BuPu',
        fill_opacity=.9,
        line_opacity=.4,
    )
)
Out[14]:

What have we learned?

With these two maps, we can see where there are relatively high levels of some kind of Air Pollutant, and in general which parts of the city feel the health effects of these pollutants the most. What happens in one part of New York surely affects the rest, often in ways that aren't obvious to us!